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Adaptive Piecewise Polynomial Estimation via Trend Filtering 

Ryan J. Tibshirani 

Abstract 

We study trend filtering, a recently proposed tool of Kim et al. (2009) for nonparametric 
regression. The trend filtering estimate is defined as the minimizer of a penalized least squares 
p^ criterion, in which the penalty term sums the absolute fcth order discrete derivatives over the 

input points. Perhaps not surprisingly, trend filtering estimates appear to have the structure of 
f->j , fcth degree spline functions, with adaptively chosen knot points (we say "appear" here as trend 

^^ filtering estimates are not really functions over continuous domains, and are only defined over 

^^ the discrete set of inputs). This brings to mind comparisons to other nonparametric regression 

("^ tools that also produce adaptive splines; in particular, we compare trend filtering to smooth- 

^~~^ ing splines, which penalize the sum of squared derivatives across input points, and to locally 

adaptive regression splines (Mammen & van de Geer 1997), which penalize the total variation 

of the fcth derivative. Empirically, we discover that trend filtering estimates adapt to the local 
level of smoothness much better than smoothing splines, and further, they exhibit a remarkable 
similarity to locally adaptive regression splines. We also provide theoretical support for these 
rG empirical findings; most notably, we prove that (with the right choice of tuning parameter) 

frt the trend filtering estimate converges to the true underlying function at the minimax rate for 

a functions whose fcth derivative is of bounded variation. This is done via an asymptotic pairing 

I I of trend filtering and locally adaptive regression splines, which have already been shown to con- 

verge at the minimax rate (Mammen & van de Geer 1997). At the core of this argument is a 
T-H new result tying together the fitted values of two lasso problems that share the same outcome 

^ vector, but have different predictor matrices. 

^D Keywords; trend filtering, nonparametric regression, smoothing splines, locally adaptive regres- 

OO sion splines, minimax convergence rate, lasso stability 
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t:}-' 1 Introduction 

O 

C^ Per the usual setup in nonparametric regression, we assume that we have observations j/i , . . . y„ G M 

^^ from the model 

> Vi = foixi) + e.^, i = l,...n, (1) 

K> where xi, . . .Xn G K are input points, /o is the underlying function to be estimated, and ei, . . . e„ 

i^ are independent errors. For most of this work, we will further assume that the inputs are evenly 

^ spaced over the interval [0, 1], i.e., Xi — i/n for i — 1 . . .n. (In Section 6.1 we consider relaxing this 



assumption.) The literature on nonparametric regression is rich and diverse, and there are many 
methods for estimating /o given observations from the model (fTl); some well-known examples include 
methods based on local polyonimals, kernels, splines, sieves, and wavelets. 

This paper focuses on a relative newcomer in nonparametric regression: trend filtering, proposed 
by Kim et al. (2009). For a given integer A: > 0, the fcth order trend filtering estimate /3 = (/3i, . . . /3„) 
of (/o(a;i), . . . fo{xn)) is defined by a penalized least squares optimization problem, 

P = argmin \\y-p\\l+'^. X\\D^''+^'> p\\,, (2) 

where A > is a tuning parameter, and £)'^'°+^^ S M^""*^)^"- is the discrete difference operator of 
order fc-|- 1. (The constant factor n'^/fc! multiplying A is unimportant, and can of course be absorbed 



into the tuning parameter A, but it will facilitate comparisons in future sections.) When A: = 0, this 
operator is 



I?W = 
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T) (n— 1) xn 



(3) 



^0 ... -1 1 

and so ||Z3^^)/3|ji — X]"=i — \f^i ~ l^i+i\- Hence the 0th order trend filtering problem, which we 
will also call constant trend filtering, is the same as f-dimensional total variation denoising (Rudin 
et al. 1992), or the 1-dimensional fused lasso (Tibshirani et al. 2005) (with pure fusion penalty, i.e., 
without an additional ii penalty on the coefficients themselves). In this case, k = 0, the components 
of the trend filtering estimate form a piecewise-constant structure, with break points corresponding 
to the nonzero entries of D^^^P = (/32 — /3i, . . . /3„ — /3„_i). See Figure [ij for an example. 
For fc > 1, the operator D('^+^^ e Rf"^'')^" is most easily defined recursively, as in 

dC^+i) ^L»(i) .L»(fe). (4) 

[Above, Z?^^-' is the {n — k) x [n— k + 1) version of the 1st order discrete difference operator (Is])]. In 
words, the definition Q says that the {k + l)st order difference operator is built up by evaluating 
differences of differences, a total of fc + 1 times. Therefore, the matrix IjC^+i) can be thought of as 
the discrete analogy to the {k-\- l)st order derivative operator, and the penalty term in ([2]) penalizes 
the discrete (fc + l)st derivative of the vector /? e M", i.e., the changes in the discrete fcth derivative 
of (3. Accordingly, one might expect the components of the fcth order trend filtering estimate to 
exhibit the structure of a piecewise polynomial of order fc — e.g., for 1st order trend filtering, the 
estimate would be piecewise linear, for 2nd order, it would be piecewise quadratic, etc. Figure [T] gives 
empirical evidence towards this claim. Later, in Section [4j we provide a more definitive confirmation 
of this piecewise polynomial structure when we examine a continuous-time representation for trend 
filtering. 
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Figure 1: Simple examples of trend filtering for constant, linear, and quadratic orders (k = 0,1,2, 
respectively), shown from left to right. Although the trend filtering estimates are only defined at the 
discrete inputs Xi = i/n, i = 1, . . .n, we use linear interpolation to extend the estimates over [0, 1] 
for visualization purposes (the default for all figures in this paper). 
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and in general, the nonzero elements in each row of D^'^^ are given by the (k + l)st row of Pascal's 
triangle, but with alternating signs. A more explicit (but also more complicated-looking) expression 
for the kth order trend filtering estimate is therefore 



argmin J ^^C,, - /3,)^ + ^ • A J] ^ (-1)-'^ ( ' + M 



1=1 4=1 



The penalty term above sums over successive linear combinations of fc + 2 adjacent coefficients, i.e., 
the discrete difference operator il)('^+^) is a banded matrix with bandwidth k + 2. 

1.1 The generalized lasso and related properties 

For any order fc > 0, the trend filtering estimate f3 is uniquely defined, because the criterion in (pi) is 
strictly convex. Furthemore, the trend filtering criterion is of generalized lasso form with an identity 
predictor matrix X — I (this is called the signal approximator case) and a specific choice of penalty 
matrix D = D'-'^^^K Some properties of the trend filtering estimate therefore follow from known 
results on the generalized lasso (Tibshirani & Taylor 2011, Tibshirani & Taylor 2012), e.g., an exact 
representation of the trend filtering estimate in terms of its active set and signs, and also, a formula 
for its degrees of freedom: 

df (/3) = E[number of knots in ^] + A; + 1, (5) 

where the number of knots in /3 is interpreted to mean the number of nonzero entries in D^^'^ >(i 
(the basis and continuous-time representations of trend filtering, in Sections 3.3 and pi provide a 



justification for this interpretation). To repeat some of the discussion in Tibshirani & Taylor (2011) 
Tibshirani & Taylor (2012), the result in (l5| may seem somewhat remarkable, as a fixed-knot fcth 
degree regression spline with d knots also has d + k + l degrees of freedom — and trend filtering does 
not employ fixed knots, but rather, chooses them adaptively. So why does trend filtering not have a 
larger degrees of freedom? At a high level, the answer lies in the shrinkage due to the £i penalty in 
([2]): the nonzero entries of D^ '^^' /3 are shrunken towards zero, compared to the same quantity for 
the corresponding equality-constrained least squares estimate. In other words, within each interval 
defined by the (adaptively chosen) knots, trend filtering fits a fcth degree polynomial whose fcth 
derivative is shrunken towards its fcth derivatives in neighboring intervals, when compared to a fcth 
degree regression spline with the same knots. Figure [2] gives a demonstration of this phenomenon 
for fc = 1 and fc = 3. 

In terms of algorithms, the fact that the discrete difference operator D^'^+i) is banded is of great 
advantage for solving the generalized lasso problem in (pi). Kim et al. (2009) describe a primal-dual 
interior point method for solving (pi) at a fixed value of A, wherein each iteration essentially reduces 
to solving a linear system in I?('=Ti)(£)('=+i))^, costing 0{n) operations. In the worst case, this 
algorithm requires 0{n^^^) iterations, so its complexity is 0(n'^/^)pj [Kim et al. (2009) focus mainly 
on linear trend filtering, the case fc = 1, but their arguments carry over to the general case as well.] 
On the other hand, instead of solving (pi) at a fixed A, Tibshirani & Taylor (2011) describe a path 
algorithm to solve ([2| over all values of A S [0, oo), i.e., to compute the entire solution path /3 = /3(A) 
over A. This path is piecewise linear as a function of A [not to be confused with the estimate itself at 
any fixed A, which has a piecewise polynomial structure over the input points xi, . . . Xn\- Again, the 



^It should be noted that hidden in the O(-) notation here is a factor depending on the prespecified error tolerance e, 
namely, a term of the form log(l/e). We emphasize here that the primal-dual interior point method is a different type 
of algorithm than the path algorithm, in the sense that the latter returns an exact solution up to computer precision, 
whereas the former returns an e-suboptimal solution, as measured by the difference in its achieved criterion value and 
the optimal criterion value. Essentially all general purpose convex optimization techniques (that are applicable to the 
trend filtering problem) fall into the same class as the primal-dual interior point method, i.e., they return e-suboptimal 
solutions; only specialized techniques like the path algorithm can deliver exact solutions. 
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Figure 2: Examples of the shrinkage effect for linear trend filtering (k ^ 1, left panel) and for cubic 
trend filtering (k = 3, right panel). In each panel, the solid red line is the trend filtering estimate 
(at a particular value of X), and the dashed blue line is the regression spline estimate of the same 
order and with the same knots, with the vertical lines marking the locations of the knots. The trend 
filtering estimates on the left and right have .shrunken 1st and 3rd derivatives, respectively, compared 
to their regression spline counterparts. 



handedness oi D^'''^^^ is key here for efheient computations, and Tihshirani & Arnold (2013) descrihe 
an implementation of the path algorithm in which computing the estimate at each successive critical 
point in the path requires 0(n) operations. 

Software for both of these algorithms is freely available online. For the primal-dual interior point 
method, see http://stanford.edu/~boyd/ll_tf, which provides Matlab and C implementations 
(these only cover the linear trend filtering case, but can easily extended to the general case); for the 
path algorithm, see the function trendf liter in the R package genlasso, available on the CRAN 
repository. 



1.2 Summary of our results 

Little is known about trend filtering — mainly, the results due to its generalized lasso form, e.g., the 
degrees of freedom result ([5]) discussed in the previous section — and much is unknown. Examining 
the trend filtering fits in Figures [T] and [2] it appears that the estimates not only have the structure of 
piecewise polynomials, they furthermore have the structure of splines: these are piecewise polynomial 
functions that have continuous derivatives of all orders lower than the leading one [i.e., a fcth degree 
spline is a fcth degree piecewise polynomial with continuous 0th through (/c — l)st derivatives at its 
knots]. Figurep^plots an example cubic trend filtering estimate, along with its discrete 1st, 2nd, and 
3rd derivatives (given by multiplication by D^^\ D'^^\ and D'-^\ respectively). Sure enough, the 
lower order discrete derivatives appear "continuous" across the knots, but what does this really mean 
for such discrete sequences? Does trend filtering have an analogous continuous-time representation, 
and if so, are the estimated functions really splines? 

Besides these questions, one may also wonder about the (relative) performance of trend filtering 
estimates, as measured by, say, the squared error in predicting the true function /q, averaged over the 
input points. Empirical examples (like those given in Section l2| show that trend filtering estimates 



Estimate 1st derivative 2nd derivative 3rd derivative 



V 



V 


A 





r\ 



0.0 0.2 0.4 0.6 O.a 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 O.B 1.C 0.0 0.2 0.4 0.6 0.8 U 

Figure 3: The leftmost panel shows the same cubic trend filtering estimate as in FigurelE (but here we 
do not use linear interpolation to emphasize the discrete nature of the estimate). The components of 
this estimate appear to be the evaluations of a continuous piecewise polynomial function. Moreover, 
its discrete 1st and 2nd derivatives (given by multiplying by D^^^ and D'-^' , respectively) also appear 
to be continuous, and its discrete third derivative (from multiplication by D^^> ) is piecewise constant 
within each interval. Hence we might believe that such a trend filtering estimate actually represents 
the evaluations of a 3rd degree spline over the inputs Xi — i/n, i = 1, . . .n. We address this idea in 
Section [^ 



achieve a significantly higher degree of local adaptivity than smoothing splines, which are arguably 
the standard tool for adaptive spline estimation. Other examples (like those in Section Isl) show 
that trend filtering estimates display a comparable level of adaptivity to locally adaptive regression 
splines, another well-known technique for adaptive spline estimation, proposed by Mammen & van de 
Geer (1997) on the basis of being more locally adaptive (as their name suggests). Although examples 
are certainly encouraging, a solely empirical conclusion here would be unsatisfactory — can we say 
more definitively that trend filtering estimates actually outperform smoothing splines, and perform 
as well as locally adaptive regression splines, in terms of squared error loss? 

We investigate the above questions in this paper. To summarize our results, wc find that: 

• for A: = 0, 1 (constant or linear orders), the continuous-time analogs of trend filtering estimates 
are indeed fcth degree splines; moreover, they are exactly the same as fcth order locally adaptive 
regression splines; 

• for k > 2 (quadratic or higher orders) , the continuous-time versions of trend filtering estimates 
are not quite splines, but piecewise polynomial functions that are "close to" splines (with small 
discontinuities in lower order derivatives at the knots); hence they are not the same as fcth 
order locally adaptive regression splines; 

• for any fc, if the fcth derivative of true function /q is of bounded variation, then the fcth order 
trend filtering estimate converges to /q (in terms of squared error loss) at the minimax rate; 
this rate is achieved by locally adaptive regression splines (Mammen & van de Geer 1997), but 
not by smoothing splines nor any other estimate linear in y (Donoho & Johnstone 1998). 

We note that, although trend filtering and locally adaptive regression splines are formally different 
estimators for fc > 2, they are practically indistinguishable by eye in most examples. Such a degree 
of similarity, in finite samples, goes beyond what we are able to show theoretically. However, we do 
prove that trend filtering estimates and locally adaptive regression spline estimates converge to each 
other asymptotically (Theorem Q. The argument here boils down to a bound on the difference in 
the fitted values of two lasso problems that have the same outcome vector, but different predictor 
matrices (because both trend filtering and locally adaptive regression splines can be represented as 



lasso problems, see Section l3|. To the best of our knowledge, this general bound is a new result 
(Appendix [B|). Further, we use this asymptotic pairing between trend filtering and locally adaptive 
regression splines to prove the minimax convergence rate for trend filtering (Corollary [I]). The idea is 
simple: trend filtering and locally adaptive regression splines converge to each other at the minimax 
rate, locally adaptive regression splines converge to the true function at the minimax rate (Mammen 
& van de Geer 1997), and hence so does trend filtering. 

1.3 Why trend filtering? 

Trend filtering estimates, we argue, enjoy the favorable theoretical performance of locally adaptive 
regression splines; but it is now a fair question to ask: why would we ever use trend filtering, over, 
say, the latter estimator? One reason is that trend filtering estimates are much easier to compute, 
due to the handedness of the discrete derivative operators, explained previously. The computations 
for locally adaptive regression splines, meanwhile, cannot exploit such sparsity or structure, and 
are considerably slower. To be more concrete, the primal-dual interior point method described in 
Section [L1| above can handle problems of size on the order of n = 1,000,000 points (and the path 
algorithm, on the order of n = 100,000 points), but even for n = 10,000 points, the computations 
for locally adaptive regression splines are prohibitively slow. We discuss this in Section [3j 

On the other hand, locally adaptive regression splines maintain their minimax convergence rate 
for unevenly spaced input points xi, . . .Xn & [0, 1], provided that the spacings are not too far apart 
(Mammen & van de Geer 1997). Our standing assumption for trend filtering has been that the 
inputs are evenly spaced {xi = i/n, i = 1, . . . n), and we do not prove a rate for trend filtering in the 
unevenly spaced case. However, it is worth noting that our proof technique for evenly spaced inputs 
is general enough that it may well carry over to the unevenly spaced case, given the proper analysis 
of trend filtering and locally adaptive regression spline basis matrices (the predictor matrices in their 
lasso representations). We briefiy discuss this extension in Section [6. 1[ a detailed examination will 
be the topic of future work. 

As a separate point, a distinguishing feature of trend filtering is that it falls into what is called 
the analysis framework with respect to its problem formulation, whereas locally adaptive regression 
splines, smoothing splines, and most others fall into the synthesis framework. Synthesis and analysis 
are two terms used in signal processing that describe different approaches for defining an estimator 
with certain desired characteristics. In the synthesis approach, one builds up the estimate construc- 
tively from a set of characteristic elements or atoms; in the analysis approach, the strategy is instead 
to define the estimate deconstructively, via an operator that penalizes undesirable or uncharacterisic 
behavior. Depending on the situation, it can be more natural to implement the former rather than 
the latter, or vice versa, and hence both are important. We discuss the importance of the analysis 



framework in the context of nonparametric regression estimators in Section 6.2 where we define 
extensions of trend filtering that would be difficult to construct from the synthesis perspective, e.g., 
a sparse variant of trend filtering. 

Here is an outline for the rest of this article (though we have discussed its contents throughout 
the introduction, we list them here in proper order). In Sections^ and p| we compare trend filtering 
to smoothing splines and locally adaptive regression splines, respectively. We give data examples 
that show trend filtering estimates are more locally adaptive than smoothing splines, and that 
trend filtering and locally adaptive regression splines are remarkably similar, at any common value 
of their tuning parameters. We also discuss the differing computational requirements for these 
methods. In Section [3j we show that both locally adaptive regression splines and trend filtering can 
be posed as lasso problems, with identical predictor matrices when fc = or 1, and with similar 
but slightly different predictor matrices when fc > 2. This allows us to conclude that trend filtering 
and locally adaptive regression splines are exactly the same for constant or linear orders, but not 
for quadratic or higher orders. Section |4] develops a continuous-time representation for the trend 
filtering problem, which reveals that (continuous-time) trend filtering estimates are always fcth order 



piecewise polynomials, but for k > 2, are not fcth order splines. In Section [5] we derive the minimax 
convergence rate of trend filtering estimates, under the assumption that the kth derivative of the 
true function has bounded total variation. We do this by bounding the difference between trend 
filtering estimates and locally adaptive regression splines, and invoking the fact that the latter are 
already known to converge at the minimax rate (Mammen & van de Geer 1997). We also study 
convergence rates for a true function with growing total variation. Section [6] contains extensions and 
discussion: unevenly spaced input points, sparse trend filtering, and the synthesis versus analysis 
perspectives. Most proofs are deferred until Appendices [A] [b1 [C] 

2 Comparison to smoothing splines 

Smoothing splines are a popular tool in nonparametric regression, and have been extensively studied 
in terms of both computations and theory [some well-known references are de Boor (1978), Wahba 
(1990), Green & Silverman (1994)]. Given input points xi,...Xn G [0,1]: which we assume are 
unique, and observations j/i, . . . y„ G M, the fcth order smoothing spline estimate is defined as 

n „1 

/= argmin ^ (y, - /(x,))' + A / {f('^\t)f dt, (6) 



/ew, 



(fc + l)/2 



where /*- 2 ) (t) is the derivative of / of order (fc + 1)/2, A > is a tuning parameter, and the domain 
of minimization here is the Sobolev space 

W(fc+i)/2 = I / : [0, 1] ^ M : / is (fc + l)/2 times differentiable, and / (/'^H*))^ dt < 00 

Unlike trend filtering, smoothing splines are only defined for an odd polynomial order fc. In practice, 
it seems that the case fc = 3 (i.e., cubic smoothing splines) is by far the most common case considered. 
In the next section, we draw a high-level comparison between smoothing splines and trend filtering, 
by writing the smoothing spline minimization problem ([6]) in finite-dimensional form. Following this, 
we make empirical comparisons, and then discuss computational efficiency. 

2.1 Generalized ridge regression and Reinsch form 

Remarkably, it can be shown that the infinite-dimensional problem in ([6]) is has a unique minimizer, 
which is a fcth degree natural spline with knots at the input points xi, . . . Xn [see, e.g., Wahba (1990), 
Green & Silverman (1994), Hastie et al. (2008)]. Recall that a fcth degree natural spline is a simply 
a fcth degree spline that reduces to a polynomial of degree (fc — l)/2 before the first knot and after 
the last knot; it is easy to check the set of natural splines of degree fc, with knots at xi, . . . x„, is 
spanned by precisely n basis functions. Hence to solve ([6]), we can solve for the coefficients 9 € M" 
in this basis expansion: 

^ = argmin ||y-7V6i||^ + A6''^06', (7) 

where N E M"^" contains the evaluations of fcth degree natural spline basis functions over the knots 
Xi, . . .Xn, and 51 G M"'^" contains the integrated products of their ((fc -I- l)/2)nd derivatives; i.e., if 
rji, . . .rjn denotes a collection of basis functions for the set of fcth degrees natural splines with knots 
at xi, . . .Xn, then 

N,j = rij{xi) and n,^ = / T]l^\t) ■ v'i'^\t) dt for all i,j = l,...n. (8) 



The problem in (I?]) is a generalized ridge regression, and from its solution 9, the function / in (l6]) is 
simply given at the input points xi, . . . x„ by 

(/(xi),.../(x„)) = m 

More generally, the smoothing spline estimate / at an arbitrary input x G [0, 1] is given by 

n 

In order to compare the smoothing spline problem, as written in Hh, with trend filtering, it helps 
to rewrite the smoothing spline fitted values as follows: 

N6 = n{n'^n + xny^N'^y 

= {I + XK)-'y, (9) 

where K = N~'^V,N~^ . The expression in (l9| is called the Reinsch form for the fitted values. From 
this expression, we can u = NO as the solution of the minimization problem 

u = sigvaiii \\y — u\\\ + \u^ Ku, (10) 

which is of similar form to the trend filtering problem in ([2]), but here the £i penalty ||D'''°"''^'/3||i is 
replaced by the quadratic penalty u^Ku — |ji^^/^M|||. How do these two penalties compare? First, 
the penalty matrix K^'^ used by smoothing splines is similar in nature to the discrete derivative 
operators [we know from its continuous-time analog in m that the term ||ii''^'^u||| penalizes wiggli- 
ness in something like the ((fc + l)/2)nd derivative of u] but is still strictly different. For example, 
for A: = 3 (cubic smoothing splines) and input points Xi — i/n, i = 1, . . . n, it can be shown (Green 
& YandeU 1985) that the smoothing spline penalty is \\K^^^u\\l = \\C-'^^^D^^'>u\\l/n^ where D^^'> is 
the second order discrete derivative operator, and C G M"^" is a tridiagonal matrix, with diagonal 
elements equal to 2/3 and off-diagonal elements equal to 1/6. 

A second and more important difference is that smoothing splines utilize a (squared) £2 penalty, 
while trend filtering uses an ii penalty. Analogous to the usual comparisons between ridge regression 
and the lasso, the former penalty shrinks the components of K^^'^u, but does not set any of the 
components to zero unless A = 00 (in which case all components are zero) , whereas the latter penalty 
shrinks and also adaptively sets components of £)('^+^'/3 to zero. One might imagine, recalling that 
K^'"^ and D^'^^^' both act in a sense as derivative operators, that trend filtering estimates therefore 
exhibit a finer degree of local adaptivity than do smoothing splines. This idea is supported by the 
examples in the next section, which show that trend filtering estimates outperform smoothing splines 
(when both are optimally tuned) in estimating functions with spatially inhomogenous smoothness. 
The idea is also supported by our theory in Section [5J where we prove that trend filtering estimates 
have a better rate of convergence than smoothing splines (in fact, they achieve the optimal rate) 
over a broad class of underlying functions. 

2.2 Empirical comparisons 

We compare trend filtering and smoothing spline estimates on simulated data. We fix fc = 3 (i.e., 
we compare cubic trend filtering vesus cubic smoothness splines), because the smooth. spline func- 
tion in the R programming language provides a fast implementation for smoothing splines in this 
case. Generally speaking, smoothing splines and trend filtering provide similar estimates when the 



underlying function /o has spatially homogeneous smoothness, or to put it simply, is either entirely 
smooth or entirely wiggly throughout its domain. Hence, to illustrate the difference between the 
two estimators, we consider two examples of functions that display varying levels of smoothness at 
different spatial locations. 

Our first example, which wc call the "hills" example, considers a piecewise cubic function /o over 
[0, 1], whose knots are spaced farther apart on the left side of the domain, but bunched closer together 
on the right side. As a result, fo{x) is smooth for x between and about 0.8, but then abruptly 
becomes more wiggly — see the top left panel of Figure |4] We drew n = 128 noisy observations from 
/o over the evenly spaced inputs Xi — i/n, i = 1, . . .n (with independent, normal noise), and fit a 
trend filtering estimate, tuned to have 19 degrees of freedom, as shown in the top right paneljj We 
can see here that the estimate adapts to the appropriate levels of smoothness at both the left and 
right sides of the domain. But this is not true of the smoothing spline estimate with 19 degrees of 
freedom, displayed in the bottom left panel: the estimate is considerably oversmoothed on the right 
side. As we increase the allowed fiexibility, the smoothing spline estimate is able to fit the small 
hills on the right, with a total of 30 degrees of freedom; however, this causes undersmoothing on the 
left side, as shown in the bottom right panel. 

For our second example, we take /o to be the "Doppler" function [as considered in, e.g., Donoho 
& Johnstone (1995), Mammen & van de Geer (1997)]. Figurelsj clockwise from the top left, displays 
the Doppler function and corresponding n — 1000 noisy observations, the trend filtering estimate 
with 50 degrees of freedom, the smoothing spline estimate with 50 degrees of freedom, and the 
smoothing spline estimate with 90 degrees of freedom. The same story, as in the hills example, holds 
here: trend filtering adapts to the local level of smoothness better than smoothing splines, which 
have trouble with the rapidly increasing frequency of the Doppler function (as x decreases). 

In figure k\ we display the average squared error lossea^ 



-Y.i^^-J^o{x^)y and -J2{f{x,)-f,{x.,)y 

i=l i=l 



for the trend filtering and smoothing spline estimates /3 and /, respectively, on the hills and Doppler 
examples. We considered a wide range of model complexities indexed by degrees of freedom, and 
averaged the results over 50 simulated data sets for each setup (the dotted lines show plus or minus 
one standard deviations). Aside from the visual evidence given in Figures HI and pj Figure p^ shows 
that from the perspective of squared error loss, trend filtering outperforms smoothing splines in 
estimating underlying functions with variable spatial smoothness. As mentioned previously, we will 
prove in Section [5] that for a large class of underlying functions /g, trend filtering estimates have a 
sharper convergence rate than smoothing splines. 

2.3 Computational considerations 

Recall that the smoothing spline fitted values are given by 

N6 = N{N'^N + \n)-^N^y, (11) 

where N e M"^" contains the evaluations of basis functions rji, . . .rjn for the subspace of fcth degree 
natural splines with knots at the inputs, and fl G M"^" contains their integrated products of their 
((fc + l)/2)nd order derivatives, as in (18]). Depending on exactly which basis we choose, computation 



of (11) can be fast or slow; by choosing the B-spline basis functions, which have local support, the 



^To be precise, this is an unbiased estimate of its degrees of freedom; see ^ in Section 1.1 

^For the Doppler data example, we actually average the squared error loss only over inputs Xi > 0.175, because for 

Xi < 0.175, the true Doppler function /o is of such high frequency that neither trend filtering nor smoothing splines 

arc able to do a decent job of fitting it. 



True function 



Trend filtering, df=19 





Smoothing spline, df=19 




Smoothing spline, df=30 




Figure 4: An example with n — 128 observations drawn from a model where the underlying function 
has variable spatial smoothness, as shown in the top left panel. The cubic trend filtering estimate with 
19 degrees of freedom, shown in the top right panel, picks up the appropriate level of smoothness at 
different spatial locations: smooth at the left side of the domain, and wiggly at the right side. When 
also allowed 19 degrees of freedom, the cubic smoothing spline estimate in the bottom left panel 
grossly underestimates the signal on the right side of the domain. The bottom right panel shows the 
smooth spline estimate with 30 degrees of freedom, tuned so that it displays the appropriate level of 
adaptivity on the right side; but now, it is overly adaptive on the left side. 
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True function 



Trend filtering, df=50 




~i \ 1 1 — 

0.0 0.2 0.4 0.6 



1.0 




Smoothing spline, df=50 




Smoothing spline, df=90 




Figure 5: An example with n — 1000 noisy observations of the Doppler function, f{x) = sin(4/a;) + 
1.5, drawn in the top left panel. The top right and bottom left panels show the cubic trend filtering 
and smoothing spline estimates, each with 50 degrees of freedom; the former captures approximately 
4 cycles of the Doppler function, and the latter only 3. If we nearly double the model complexity, 
namely, we use 90 degrees of freedom, then the smoothing spline estimate is finally able to capture 4 
cycles, but the estimate now becomes very jagged on the right side of the plot. 
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Hills example 



Doppler example 





Degrees of freedom 



Degrees of freedom 



Figure 6: Shown is the squared error loss in predicting the true function /g, averaged over the input 
points, for the hills data example on the left, and the Doppler example on the right. In each setup, 
trend filtering and smoothing spline estimators were fit over a range of degrees of freedom values; the 
red curves display the loss for trend filtering, and the blue curves for smoothing splines. The results 
were averaged over 50 simulated data sets, and the standard deviations are denoted by dotted lines. 
In these examples, trend filtering has a generally better predictive accuracy than smoothing splines, 
especially for models of low to intermediate complexity (degrees of freedom). 



matrix N'^ N + Xfl is banded, and so the smoothing sphne fitted values can be computed in 0{n) 
operations [e.g, see de Boor (1978)]. In practice, these computations are extremely fast. 

By comparison, Kim et al. (2009) suggest a primal-dual interior point method, as mentioned in 
Section |1.1[ that computes the trend filtering estimate (at any fixed value of the tuning parameter 
A) by iteratively solving a sequence of banded linear systems, rather than just a single one. Theoret- 
ically, the worst-case number of iterations scales as 0{n^'^), but the authors report that in practice 
the number of iterations needed is only a few tens, almost independent of the problem size n. Hence 
trend filtering computations with the primal-dual path interior point method are slower than those 
for smoothing splines, but not by a huge margin. 

To compute the trend filtering estimates for the examples in the previous section, we actually 
used the dual path algorithm of Tibshirani & Taylor (2011), which was also discussed in Section 
|1.1[ Instead of solving the trend filtering problem at a fixed value of A, this algorithm constructs 
the solution path as A varies from oo to 0. Essentially, it does so by stepping through a sequence 
of estimates, where each step either adds one knot to or deletes one knot from the fitted piecewise 
polynomial structure. The computations at each step amount to solving two banded linear systems, 
and hence require 0{n) operations; the overall efficiency depends on how many steps along the path 
are needed before the estimates of interest have been reached (at which point the path algorithm 
can be terminated early). But because knots can be both added and deleted to the fitted piecewise 
polynomial structure at each step, the algorithm can take much more than k steps to reach an 
estimate with k knots. Consider the Doppler data example, in the last section, with n — 1000 
points: the path algorithm used nearly 4000 steps to compute the trend filtering estimate with 46 
knots (50 degrees of freedom) shown in the upper right panel of Figure^ This took approximately 28 
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seconds on a standard desktop computer, compared to the smoothing sphne estimates shown in the 
bottom left and right panels of Figure [5] which took about 0.005 seconds each. We reiterate that in 
this period of time, the path algorithm for trend filtering computed a total of 4000 estimates, versus 
a single estimate computed by the smoothing sphne solver. (A quick calculation, 28/4000 — 0.007, 
shows that the time per estimate here is comparable.) For the hills data set in the last section, where 
n — 128, the dual path algorithm constructed the entire path of trend filtering estimates (consisting 
of 548 steps) in less than 3 seconds; both smoothing spline estimates took under 0.005 seconds each. 

3 Comparison to locally adaptive regression splines 

Locally adaptive regression splines are an alternative to smoothing splines, proposed by Mammen 
& van de Geer (1997). They are more computationally intensive than smoothing splines but have 
better adaptivity properties (as their name would suggest). Let xi, . . .x„ € [0, 1] denote the inputs, 
assumed unique and ordered as in xi < 0:2 < ■ • ■ < a^n, and 2/1, . . . ?/„ S K denote the observations. 
For the fcth order locally adaptive regression spline estimate, where fc > is a given arbitrary integer 
(not necessarily odd, as is required for smoothing splines), we start by defining the knot superset 

j,^ f{2;fc/2+2,---a;n-fc/2} if fc is even, 

[{x(fc+i)/2+i,...a;„_(fe+i)/2} if fc is odd. 

This is essentially just the set of inputs \x\^ . . . a;„}, but with points near the left and right boundaries 
removed. We then define the fcth order locally adaptive regression spline estimate as 

n 

f = argmin - ^ (y, - /(x.))' + A • TV(/('=)), (13) 

where f^''^ is now the kth weak derivative of /, A > is a tuning parameter, TV(-) denotes the total 
variation operator, and Gk is the set 

5fe = {/ : [0, 1] -^ IR : /is fcth degree spline with knots contained in T}. (14) 

Recall that for a function / : [0, 1] -^ M, its total variation is defined as 

TV(/) = sup <^ ^ 1/(2:^+1) - fizt)\ : zi <...< Zp is a partition of [0, 1] L 

and this reduces to TV(/) = /g I/' (01 '^^ if / is (strongly) differentiable. 

Next, we briefly address the difference between our definition of locally adaptive regression splines 



in (13) and the original definition found in Mammen & van de Geer (1997); this discussion can be 



skipped without interrupting the fiow of ideas. After this, we rewrite problem ( 13 ) in terms of the 
coefficients of / with respect to a basis for the finite-dimensional set Gk- For an arbitrary choice of 
basis, this new problem is of generalized lasso form, and in particular, if we choose the truncated 
power series as our basis for Gk-, it simply becomes a lasso problem. We will see that trend filtering, 
too, can be represented as a lasso problem, which allows for a more direct comparison between the 
two estimators. 

3.1 Unrestricted locally adaptive regression splines 

For readers familiar with the work of Mammen & van de Geer (1997), it may be helpful to explain 
the difference between our definition of locally adaptive regression splines and theirs: these authors 
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define the locally adaptive regression spline estimate as 



/ e argmin - 



E 



(y.-/(a;.)) +A-TV(/(''^)), 



(15) 



where J-h is the set 



H 



|/ : [0, 1] ^ M : / is fc times weakly differentiable and TV(/(''^) < ooj 



[The element notation in ( 15 ) emphasizes the fact that the minimizer is not generally unique.] We 
call (15) the unrestricted \oc&\\y adaptive regression spline problem, in reference to its minimization 
domain compared to that of (13). Mammen & van de Geer (1997) prove that the minimum in 



this unrestricted problem is always achieved by a fcth degree spline, and that this spline has knots 
contained in T if fc = or 1, but could have knots outside of T (and in fact, outside of the input set 
{xi , . . . ccn}) if fc > 2. In other words, the solution in ( 13 ) is always a solution in ( 15 ) when fc = or 
1, but this need not be true when fc > 2; in the latter case, even though there exists a fcth degree 
spline that minimizes (15), its knots could occur at non-input points. 



The unrestricted locally adaptive regression estimate ( 15 ) is the main object of theoretical study 
in Mammen & van de Geer (1997), but practically speaking, this estimate is difficult to compute 
when fc > 2, because the possible knot locations are generally not easy to determine [see also Rosset 
& Zhu (2007)]. On the other hand, the restricted estimate as defined in ( 13 ) is more computationally 



tractable. Fortunately, Mammen & van de Geer (1997) show that essentially all of their theoretical 
results for the unrestricted estimate also apply to the restricted estimate, as long as the input points 
xi, . . .Xn are not spaced too far apart. In particular, for evenly spaced inputs, Xi = i/n, z = 1, . . . n, 
the convergence rates of the unrestricted and restricted estimates are the same. We therefore focus 



on the restricted problem ( 13 ) in the current paper, and mention the unrestricted problem ( 15 ) 
purely out of interest. For example, to anticipate results to come, we will show in Lemma [3] of 
Section 3.3 that the trend filtering estimate (pi) for fc = or 1 is equal to the locally adaptive 
regression spline estimate (13) (i.e., they match at the input points xi, . . .a;„); hence, from what we 



discussed above, it also solves the unrestricted problem in ( 15 ) 



3.2 Generalized lasso and lasso form 



We note that the knot set T in ( 12 ) has n — k — 1 elements, so Qk is spanned by n basis functions. 



call them gi, . . . g^. Since each gj, j = 1, . . .n \s & fcth degree spline with knots in T, we know that 
its fcth weak derivative is piecewise constant and (say) right-continuous, with jump points contained 
in T; therefore, writing to = and T — {ti, . . . i„-/c-i}, we have 



i-fc-i 



TV(g, 



E kf'(*o- 5^(^.-1)1. 



i=l 



Similarly, any linear combination of gi , . . . (?„ satisfies 

-k-l 

-j9j 



i=l 






i=i 



Hence we can reexpress ( 13 ) in terms of the coefficients 6 G M" in its basis expansion with respect 
to gi,. ..gn, 

,2 , Mi^^n (16) 



1, 



= argmin -\\y-Ge\\i + X\\Ce\\u 
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where G E K"^" contains the evaluations of 51, ... g„ over the inputs xi, . 
contains the differences in their kth derivatives across the knots, i.e., 



^■y — 9j{^i) 



Ci. 



-(fe 



5r(io-5r(t.-i) 



ik). 



for i,j = l,...n, 
for i = 1 , . . . n — fc 



1, 3 



Xn 


, and C 


eM("" 


-k- 


-l)xn 

(17) 


= 


l,...n. 






(18) 



Given the solution d in (|16|), wc can recover the locally adaptive regression spline estimate / in (13) 

{f{x,),...f{x^))^G9, 



over the input points by 

or, at an arbitrary point x G [0, 1] by 



fix) 



J2^j9jix)- 



The problem ( 16 ) is a generalized lasso problem, with predictor matrix G and penalty matrix C; by 



taking gi, . . .gn to be the truncated power basis, we can turn (a block of) C into the identity, and 
hence ( 16 ) into a lasso problem. 



Lemma 1. Let T ~ {ti, . . .tn-k-i} denote the set defined in (12), and let gi, ...gn denote the kth 
order truncated power basis with knots in T , 



gi{x) = 1, .92(2;) ^x, ... gk+i(x) = x*^, 
gk+i+j{x) = {x- tjf ■ l{x > tj}, j = l,...n- 



(19) 



(For the case k — 0, we interpret 0° = l.J Then the locally adaptive regression spline problem (13) 
is equivalent to the following lasso problem: 



§ = aigmin -\\y - GeWi + \ 



3 = k+2 



(20) 



in that f(x) — Yll=i ^j9jix) for x G [0, 1]. Here G G 



is the basis matrix as in (17). 



Proof. This follows from the fact that for the truncated power basis, the penalty matrix C in (18) 
satisfies Gi^i-^k+i = 1 for i = l,...n — A: — 1, and Gij = otherwise. D 



It is worth noting that Osborne et al. (1998) investigate a lasso problem similar to (20) for the 
purposes of knot selection in regression splines. Note that (20) is of somewhat nonstandard form for 
a lasso problem, because the ii penalty is only taken over the last n — k — 1 components of 6. We 
will see next that the trend filtering problem in ^ can also be written in lasso form (again with the 
£1 penalty summing over the last n — k—1 coefficients), and we will compare these two formulations. 



First, however, it is helpful to express the knot superset T in (12 1 and the basis matrix G in (17) in 
a more explicit form, for evenly spaced input points Xi — i/n, i — 1, . . . n (this being the underlying 
assumption for trend filtering). These become: 



T = 



((fc + 2)/2 + ?;)/n fori = l,. 


. . n 


{{k + \)/2 + i)/n fori = l,. 


. . n 



A: — 1, if fc is even, 
k — 1, if fc is odd, 



(21) 
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and 



,-j-i/„j-i 



Gij — < 



for i < j, 
for i > j, 

for i = l,...n, j = 1, . . . k + 1, 
for i<j- fc/2, j >k + 2, 
for i > j - fc/2, j > fc + 2, 

for i = l,...n, j = l,...fc + l, 
for i < J - (fc + l)/2, j > fc + 2, 

(i - j + (fc + l)/2)''/n'= for i > j - (fc + l)/2, j > fc + 2. 



/7lJ 



(i - J + fc/2) V"'' 



,-J-l/riJ-l 



/nJ 



if fc = 0, 



if fc > is even, 



if fc > is odd. 



(22) 



(It is not really important to separate the definition of G for fc = from that for fc > 0, fc even; this 
is only done to make transparent the structure of G.) 

3.3 Trend filtering in lasso form 

We can transform the trend filtering problem in (pi) into lasso form, just like the representation for 
locally adaptive regression splines in (|20|). 



(23) 



Lemma 2. The trend filtering problem in ^ is equivalent to the lasso proble 



1 



a e argmin -\\y - Ha\\l + A ^ 



in that the solutions satisfy /3 — Ha. Here, the predictor matrix H G 



Hij - < 



for i = 1, . . . n, j — 1, . . . k 
fori<j~l, j>k + 2, 

[^t]+i ■ kl/n'' for z > J - 1, J > fc + 2, 



is given by 
1, 



(24) 



where we define a. 



(0) 



1 for all i and 

afU^af-^) /orfc = l,2,3, 



j=i 



^k) 



i.e., (j\ is the kth order cumulative sum of (1, 1, ... 1) G M*. 

The proof of Lemma b^ essentially inverts the (fc + l)st order discrete difference operator ZjC^+i); 
it is not difficult, but requires some calculation, so we defer it until Appendix] A. 1[ 



It is not hard to check that in the case fc = or 1, the definitions of G in (|22|) and H in (24) 



coincide, which means that the locally adaptive regression spline and trend filtering problems (20) 



and (23 1 are the same. But when fc > 2, we have G ^ H, and hence the problems are different. 



Lemma 3. Consider evenly spaced inputs Xi — i/n, i = 1, . . .n, and the basis matrices G, H defined 



in (22), (24). If k = or 1, then G — H, so the lasso representations for locally adaptive regression 



splines and trend filtering, ( 20 ) and ( 23 ) , are the same. Therefore their solutions are the same, or 
in other words, 

Pi^f{xt) for i = l,...n, 
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where (3 and f are the solutions of the original trend filtering and locally adaptive regression spline 
problems, ([2]) and (13), at any fixed common value of the tuning parameter X. 

If k>2, however, then G ^ H, so the problems (20 1 and (23) are different, and hence the trend 



filtering and locally adpative regression spline estimators are generically different. 



Proof. By inspection of ( 22 ) and ( 24 ) , we have 



G = H 



1 
1 1 

1 1 



if /c = 0. G = H 
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1 
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1 
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1 


.. 



n n n 



1 



if fc = 1, 



and G ^ H ii k >2 [in this case G and H differ by more than a scalar factor, so problems (20) and 
( 23 ) cannot be equated by simply modifying the tuning parameters] . D 



Though the trend filtering and locally adaptive regression spline estimates are formally different 
for polynomial orders k > 2, they are practically very similar (at all common values of A). We give 
examples of this next, and then compare the computational requirements for the two methods. 



3.4 Empirical comparisons 

We revisit the hills and Doppler examples of Section 2.2 Figure [t] displays, for A; = 3 (cubic order), 
the trend filtering and locally adaptive regression spline estimates at matching values of the tuning 
parameter A. The estimates are visually identical in both examples (but not numerically identical — 
the average squared difference between the estimates across the input points is around 10~^ for the 
hills example, and 10~^ for the Doppler example). This remains true for a wide range of common 
tuning parameter values, and only for very small values of A do slight differences between the two 
estimators become noticeable. 

Nothing is special about the choice fc = 3 here or about these data sets in particular: as far as 
we can tell, the same phenomenon occurs for any polynomial order fc, and any set of observations. 
This extreme similarity between the two estimators, holding in finite sample and across essentially all 
common tuning parameter values, is beyond what we show theoretically in SectionlSJ In this section, 
we prove that for tuning parameters of a certain order, the two estimators converge asymptotically 
at a fast rate. Sharper statements are a topic for future work. 



3.5 Computational considerations 

Both the locally adaptive regression spline and trend filtering problems can be represented as lasso 
problems with dense, square predictor matrices, as in (20) and (23). For trend filtering, we do this 



purely for analytical reasons, and computationally it is much more efficient to work from its original 
representation inj2l, where the penalty operator dC'+i) is sparse and banded. As discussed in 
Sections |1.1| and |2.3[ two efficient algorithms for trend filtering are the primal-dual interior point 
method of Kim et al. (2009) and the dual path algorithm of Tibshirani & Taylor (2011); the former 
computes the trend filtering estimate at a fixed value of A, in 0{n^''^) worst-case complexity [the 
authors claim that the practical complexity is closer to 0{n)]; the latter computes the entire solution 
path over A, with each critical point in this piecewise linear path requiring 0{n) operations. 

For locally adaptive regression splines, on the other hand, there is not a better computational 



alternative than solving the lasso problem in (20). One can check that the inverse of the truncated 



power basis matrix G is dense, so if we converted (20) to generalized lasso form [to match the form 
of trend filtering in ([2])], then it would have a dense penalty matrix. And if we were to choose, e.g.. 
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Figure 7: Trend filtering and locally adaptive regression spline estimates, using the same values of 
the tuning parameter A, for the hills and Doppler data examples considered in Section \2.S\ The trend 
filtering estimates are drawn as solid red lines, and locally adaptive regression splines as dotted blue 
lines; in both examples, the two estimates are basically indistinguishable by eye. 



the B-spline basis over the truncated power basis to parametrize Gk{T) in (13), then ahhough the 
basis matrix G would be sparse and banded, the resulting penalty matrix C in ( 16 ) would be dense. 



In other words, to compute the locally adaptive regression spline estimate, we are more or less stuck 
with solving the lasso problem in (20), where G is the dense predictor matrix in (22 1. This task is 



manageable for small or moderately sized problems, but for large problems, dealing with the n x n 
dense matrix G, and even holding it in memory, becomes burdensome. 

To compute the locally adaptive regression spline estimates for the examples in the last section. 



we solved the lasso problem in ( 20 ) using the LARS algorithm for the lasso path (Efron et al. 2004) , 
as implemented by the lars R package|j This particular algorithm was chosen for the sake of a 
fair comparison to the dual path algorithm used for trend filtering. For the Doppler data example 
with n — 1000 points, the LARS algorithm computed the locally adaptive regression spline estimate 
(shown in the right panel of Figure nh in a comparable amount of time to that taken by the dual 
path algorithm for trend filtering — in fact, it was faster, at approximately 16 versus 28 seconds on 
a standard desktop computer. The real issue, however, is scalability. For n = 1000 points, each of 
these algorithms required about 4000 steps to compute their respective estimates; for n — 10, 000 
noisy observations from the Doppler curve, the dual path algorithm completed 4000 steps in a little 
under 2.5 minutes, whereas the LARS algorithm completed 4000 steps in 1 hour. Furthermore, for 
problem sizes n somewhat larger than n = 10, 000, just fitting the n x n basis matrix G used by the 
LARS algorithm into memory becomes an issue. 



4 Continuous-time representation 



Section 3.3 showed that the trend filtering minimization problem (pi) can be expressed in lasso form 
(23), with a predictor matrix H as in (24). The question we consider is now: is there a set of basis 



*To fit the problem in pOl into standard lasso form, i.e., a form in which the £i penalty is taken over the entire 
coefficient vector, we can solve directly for the first A; + 1 coefficients (in terms of the last n—k — 1 coefficients) , simply 
by linear regression. 
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functions whose evaluations over the inputs xi, . . . a;„ give this matrix HI Our next lemma answers 
this question affirmatively. 

Lemma 4. Given inputs xi < . . . < x„, consider the functions /ii, . . . /i„ defined as 



hi{x) = 1, h2{x) = X, ... hk+i{x) = X ', 
k 
hk+i+j(x) = ]^(a;-a;j+£) • l{a; > Xj+i}, j ^l,...n- 



(25) 



f=i 
// the input points are evenly spaced over [0, 1], Xi ~ i/n for i = 1, . . .n, then the trend filtering basis 



matrix H in (24) is generated by evaluating the functions hi, . . . hn over xi, . . . Xn, i-C., 

Hij=hj{xi), i,j = l,...n. (26) 



The proof is given in Appendix] A. 2 As a result of the lemma, we can alternatively express the 



trend filtering basis matrix iJ in ( 24 ) as 



{i^ ^ /n^ ^ ior i = 1, . . .n, j — 1, . . .k + 1, 

for i<j-l, j>fc + 2, (27) 

nLi(* -{j-k-l + £))M for z > J - 1, j>k + 2. 

This is a helpful form for bounding the difference between the entries of G and H, which is needed 



for our convergence analysis in the next section. Moreover, the functions defined in (25) give rise to 
a natural continuous-time parametrization for trend filtering. 



Lemma 5. For inputs xi < . . . < Xn, and the functions hi, . . . /i„ in (25), define the linear subspace 
of functions 

Hk = span{ft,i, . . . hn} = < \J ctjTij : ai, . . .a„ e M >. (28) 

l j=i J 

If the inputs are evenly spaced, x^ = i/n, i = 1, . . .n, then the continuous-time minimization problem 

n 

f = argmin - J] (y. - f{x,)f + A • TV(/(^-)) (29) 

(where as before, f^^' is understood to mean the kth weak derivative of f) is equivalent to the trend 
filtering problem in ([2]), i.e., their solutions match at the inputs points, 

h = I{xi) for i^l,...n. 
Proof. Expressing / in terms of the finite-dimensional parametrization / = '^^^iCtjhj transforms 



( 29 ) into the lasso problem ( |23[ ) , with H the basis matrix as in ( 26 ) ; this is then equivalent to the 
trend filtering problem in (l2| by Lemmas [4] and [2] D 

Lemma pi says that the components of trend filtering estimate, /3i,.../3„, can be seen as the 
evaluations of a function / g Jik over the input points, where / solves the continuous-time problem 



(29). The function / is a piecewise polynomial of degree k, with knots contained in {xk+2, ■ ■ ■ Xn}, 
and for fc > 1, it is continuous since each of the basis functions hi, . . .hn are continuous. Hence for 
A: = or 1, the continuous-time trend filtering estimate / is a spline (and further, it is equal to the 
locally adaptive regression spline estimate by Lemmapl). But / is not necessarily a spline when fc > 2, 
because in this case it can have discontinuities in its lower order derivatives (of orders 1, . . . fc— 1) at 
the input points. This is because each basis function hj, j = k + 2, . . .n, though infinitely (strongly) 
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Trend filtering basis 



Zoomed in TF basis 





Figure 8: For n — 22 inputs (evenly spaced over [0, 1]) and k — 3, the left panel shows the truncated 
power basis functions in (19 1 and the center panel shows the basis functions in (25) utilized by trend 



filtering. The two sets of basis functions appear very similar. The right plot is a zoomed in version 
of the center plot, and shows the nonsmooth nature of the trend filtering basis functions — here (for 
k = 3) they have discontinuous first and second derivatives. 



differentiable in between the inputs, has discontinuous derivatives of ah lower orders 1, ... fc— 1 at the 
input point a^j-i- These discontinuities are visually quite small in magnitude, and the basis functions 
hi, . . .hn look extremely similar to the truncated power basis functions (71 , . . . g„ ; see Figure |8] for 
an example. 

Loosely speaking, the basis functions hi, . . .hn in ( 25 ) can be thought of as the falling factorial 
analogs of the truncated power basis 51, ...(/„ in ( 19 1. One might expect then, that the subspaces of 



kth degree piecewise polynomial functions Hk and Q^. are fairly close, and that the (continuous-time) 
trend filtering and locally adaptive regression spline problems (29) and (13) admit similar solutions. 



In the next section, we prove that (asymptotically) this is indeed the case, though we do so by 
instead studying the discrete-time parametrizations of these problems. We show that over a broad 
class of true functions /o, trend filtering estimates inherit the minimax convergence rate of locally 
adaptive regression splines, because the two estimators converge to each other at this same rate. 



5 Rates of convergence 



In this section we assume that the observations are drawn from the model 

Vi ^ foixi) + e.i, i = l,...n, 



(30) 



where x^ = i/n, i = 1, . . . n are evenly spaced input points, /g : [0, 1] — > M is an unknown regression 
function to be estimated, and e^, i = 1, . . . n are i.i.d. sub-Gaussian errors with zero mean, i.e.. 



E[ei]=0, P(|e,| > t) < Afexp(-fV(2c^^)) for aU t > 0, i = 1, . 



(31) 



for some constants M, a > 0. 

In Mammen & van de Geer (1997), the authors consider the same setup, and study the perfor- 
mance of the locally adaptive regression spline estimate ( 13 ) when the true function /p belongs to 
the set 

jrfc(C) = |/ : [0, 1] ^ M : f is k times weakly differentiable and TV(/('=)) < c\ , 



20 



for some order fc > and constant C > 0. Theorem 10 of Mammen & van de Geer (1997) shows that 
the fcth order locally adaptive regression spline estimate / in (13), with A = Q{n^^^'^'''^^'>), satisfies 



(/>.)- /o(:^.))'-Op(n-(^'=+^)/(^^+^)), 



(32) 



and also that TV(/) == Op(l). 

5.1 Minimax convergence rate 

As an aside, we note that the rate 7i-(2's+2)/(2fc+3) j^^ |32| is the minimax rate for estimation over 



the function class Tk{C), provided that C > 1. To see this, define the Sobolev smoothness class 



Wk{C) = If : [0,1] 



f is k times difi'erentiable and / {f^^\t))^ dt < C 

Jo 



Minimax rates over the Sobolev classes are well-studied, and it is known [e.g., see Nussbaum (1985)] 
that 



min max E 

/ /oeWfc(C) 



1 . 



n{n 



-2k/{2k+l) 



)• 



Recalling that TV(g) = /^ \f'it)\dt for differentiable /, it follows that J"fc(C) D Wk+i(C - 1), and 



min max E 



1 " 



n{n 



-(2fe+2)/(2fc+3) 



In words, one cannot hope to do better than j7,-(2'i;+2)/(2/c+3) j^j. fi^nction estimation over Tk{C). 

On the other hand, the work of Donoho & Johnstone (1998) provides a lower bound on the rate 
of convergence over TkiC) for any estimate linear in y — by this, we mean that the vector of its fitted 
values over the inputs is a linear function of y. This covers smoothing splines [recall the expression 
( 11 1 for the smoothing splines fitted values] and also, e.g., kernel regression estimators. Letting B" 
denote the three parameter Besov space as in Donoho & Johnstone (1998), and || • ||bq denote the 
correpsonding norm, we have 



^k{C)^\f 

- 


[0, 1] ^ M 


ll/('^^lloo+TV(/ 


-{^ 


[0, 1] ^ M 


II/'''IIbi, <c"} 


-{^ 


[0, 1] ^ M 


\\f\\Biy<C"}, 



(33) 

where we write ||/||oo = ma-Xtgro,!] 1/(^)1 for the Loo function norm, and C ,C" are constants. The 
second containment above follows from a well-known embedding of function spaces [e.g., see Mallat 
(2008), Johnstone (2011)]. The third containment is given by applying the Johnen-Scherer bound 
on the modulus of continuity [e.g.. Theorem 3.1 of DeVore & Lorentz (1993)] when working with 
the usual definition of the Besov normsP] Since the minimax linear risk over the Besov ball in (33) 
is of order n-(2fe+i)/(2fc+2) ponoho & Johnstone 1998), we havcj^ 



min max E 

/ linear fo£^k{C) 



1 " 



= f](n-(2fc+i)/(2fe+2))^ 



Hence, in terms of their convergence rate over J^fc(C), smoothing splines are suboptimal. 

^Thanks to Iain Johnstone for pointing this out. 

®These authors actually study minimax rates under the L2 function norm, instead of the discrete (input-averaged) 
norm that we consider here. However, these two norms are close enough over the Besov spaces that the rates do not 
change; see Section 15.5 of Johnstone (2011). 
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5.2 Trend filtering convergence rate 

Here we show that trend filtering also achieves the minimax convergence rate over Fk{C). The 
arguments used by Mammen & van de Geer (1997) for locally adaptive regression splines cannot 
be directly applied here, as they are based some well-known interpolating properties of splines that 
do not easily extend to the trend filtering setting. Our strategy is hence to show that, as n — >■ oo, 
trend filtering estimates lie close enough to locally adaptive regression spline estimates to share 
their favorable asymptotic properties. (Note that for fc = or fc = 1, the trend filtering and locally 
adaptive regression spline estimates are exactly the same for any given problem instance, as shown 
in Lemma [3] in Section [3j therefore, the arguments here are really directed towards establishing a 
convergence rate for trend filtering in the case k > 2.) Using the Cauchy- Schwartz inequality (i.e., 
using \\a + b\\l = \\a\\l + \\b\\l + 2a'^b < 3\\a\\l + 3\\b\\l), we have 

I E (A - M^^)y <lJ2i^- />o)' + ^ E (/>.) - /o(xo)^ (34) 

i—1 i— 1 i — 1 

where /3, / are the trend filtering and locally adaptive regression spline estimates in (pi), ( [T3| , respec- 
tively. The second term above is (9p(n~(^'^+^'/(^'^+'^)) by (32 1; if we could show that the first term 



above is also Op(n~*^^'^+^)/(^'^+^)), then it would follow that trend filtering converges (in probability) 
to /o at the minimax rate. 

Recall from Section[3]that both the trend filtering and locally adaptive regression spline estimates 
can be expressed in terms of the fitted values of lasso problems, 

P^Ha, {f{xi),...f{xn))=G9, 
where G,H £ M"^" are the basis matrices in ([22|), ([24|, and a, 9 are the solutions in lasso problems 



(20), (23|. Hence we seek abound for ^,"^j^(/3i — /(xi))^ = Hi/d—G^lH, the (squared norm) difference 
in fitted values between two lasso problems with the same outcome y, but different predictor matrices 
G, H. Intuitively, a tight bound would seem feasible here because G and H have such similar entries 
(again, for fc = or fc = 1, we know that indeed G = H). 

While there are existing results on the stability of the lasso fit as a function of the outcome vector 
y [e.g., Tibshirani & Taylor (2012) show that for any fixed predictor matrix and tuning parameter 
value, the lasso fit is nonexpansive as a function of y], as far as we can tell, general stability results 
do not exist for the lasso fit as a function of its predictor matrix. To this end, in Appendix iB] we 
develop bounds for the difference in fitted values of two lasso problems that have different predictor 
nratriccs, but the same outcome. The bounds are asymptotic in nature, and are driven primarily 
by the maximum elementwise difference between the predictor matrices. We can apply this work in 
the current setting to show that the trend filtering and locally adaptive regression spline estimates 
converge (to each other) at the desired rate, 7i~(2/c+2)/(2fc+3). essentially, this amounts to showing 
that the elements oi G — H converge to zero quickly enough. 



Theorem 1. Assume thaty £ K" is drawn from the model (30), with evenly spaced inputs Xi = i/n, 
i = l,...n and sub-Gaussian errors (31). Assume also that /o £ !Fk{G), i.e., for a fixed integer 



A: > and constant C > 0, the true function /o is k times weakly differentiable and TV(/q ) < C. 
Let f denote the kth order locally adaptive regression spline estimate in ( |13[ ) with tuning parameter 
A — ld{n^'^'^'''^^'), and let (3 denote the kth order trend filtering estimate in ([2]) with tuning parameter 
(1 -I- e)A, for any fixed e > 0. Then 



n 

- E (ft - />.))' = op(n-(2'=+2)/(2'=+3)). 



n 

i=l 
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Proof. We use Corollary [4] in Appendix[B] with X = G and Z ^ H. First note that our sub-Gaussian 
assumption in ( |31[ ) implies that IE[e^] < oo (indeed, it implies finite moments of all orders), and with 
M = {fo{^i)i ■ ■ ■ foixn)), we know from the result of Mammen & van de Geer (1997), paraphrased in 



( [32| ), that 

Furthermore, the locally adaptive regression spline estimate / has total variation 

TV(/)-|l^2||i = 0p(l), 

where 62 denotes the last p2 = n — k — 1 components of 9. Therefore, recalling that A = 6(n^/(^'^+'^)) 
the remaining conditions ( 53 1 needed to apply Corollary H reduce to 



^(2fe+2)/(2fc+3)||g^ - i/211^ ^ as n -^ 00, 

where G2 and H2 denote the last n—k—1 columns of G and H , respectively, and ||A||oo denotes the 
maximum absolute element of a matrix A. The above limit can be established by using Stirling's 



formula to bound the elementwise differences in G2 and H2] see Lemma 10 in AppendixO Therefore 
we apply Corollary |4] to conclude that 



||G^ - Ha\\2 = Op(\/nV(2fc+3)). 
Squaring both sides and dividing by n gives the result. D 

Remark. It may seem a little strange to choose different tuning parameter values for the locally 
adaptive regression spline and trend filtering problems (Theorem [I] chooses a tuning parameter A for 
the locally adaptive regression spline problem, and a tuning parameter (1 + e)A for trend filtering), 
given that we want to equate the fitted values of these two problems. However, it turns out that this 
"extra" amount of regularization for trend filtering is needed in order to remove the dependence of 
the final bound on the total variation of the trend filtering estimate (||q!2||ij in the notation of the 
proof). See the remark following Lemma ^ in Appendix [B| 



adaptive regression spline estimate (32)], we arrive at the following result 



Now, using the Cauchy- Schwartz inequality ( 34 ) [and recalling the convergence rate of the locally 



Corollary 1. Under the assumptions of TheoremVA for a tuning parameter value A — Q{n^i^'^^^^'), 
the kth order trend filtering estimate /3 m (pi) satisfies 



n ^-^ 



/o(x.))' = Op(n-(2'=+2)/(2'c+3)) 



1=1 
Hence the trend filtering estimate converges in probability to /q at the minimax rate. 



Remark. Mammen & van de Geer (1997) prove the analogous convergence result (32 1 for locally 
adaptive regression splines using an elegant argument involving metric entropy and the interpolating 
properties of splines. In particular, a key step in their proof uses the fact that for every fc > 0, and 
every function / : [0, 1] — >■ M that has k weak derivatives, there exists a spline g G Qk [i.e., g is a 



spline of degree k with knots in the set T, as defined in (14)] such that 



max \f{x) - g{x)\ < dfeTV(/('=))n-'= and TV(5('=)) < 4TV(/('=)), (35) 

where dk is a constant depending only on k (not on the function /). Following this line of argument 
for trend filtering would require us to establish the same interpolating properties (35) with h G Hk 



in place of g, where T-Lk, as defined in (25), (28), is the domain of the continuous-time trend filtering 
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minimization problem in (29). Because lik does not contain spline functions, but instead functions 
that can have discontinuous lower order derivatives at the input points xi,...Xn, proving such an 
interpolating result would be quite a difficult task. We circumvented this difficulty by proving that 
trend filtering estimates converge to locally adaptive regression spline estimates at a rate equal to the 
minimax convergence rate (Theorem [l]), therefore "piggybacking" on the locally adaptive regression 
splines rate due to Mammen & van de Geer (1997). 

5.3 Functions with growing total variation 

We consider an extension to estimation over the function class Fk{Cn), where now C„ > is not 
necessarily a constant and can grow with n. As in the last section, we rely on a result of Mammen & 
van de Geer (1997) for locally adaptive regression splines in the same situation, and prove that trend 
filtering estimates and locally adaptive regression spline estimates are asymptotically very close. 



Theorem 2. Assume that y e M" is drawn from the m,odel (30 1, with inputs Xi — i/n, i — 1, . . . n 
and sub-Gaussian errors (31). Assume also that /q G J'k{Cn), i-e., for a fixed integer fc > and 
Cn > (depending on n), the true function /o is k times weakly differentiate and TV(/o ) < C„. 



Let f denote the kth order locally adaptive regression spline estimate in (13) with tuning parameter 
A = 0(n^' '^'^+'^'C„ ), and let f3 denote the kth order trend filtering estimate in ([2]) with 



tuning parameter (1 + e)A, for any fixed e > 0. // C„ does not grow too quickly, 

C„ = 0(„(fe+2)/(2fe+2))^ 



(36) 



the 



1 " 

-Y,{^- f{^^)f = Op(n-(2fc+2)/(2fc+3)c2/(2fc+3)). 



i=l 



Proof. The arguments here are similar to the proof of TheoremIT] We invoke Theorem 10 of Mammen 
& van de Geer (1997), for the present case of growing total variation C„: this says that 



n 
- E (/(•^*) - foi^^)T = Op(n-(2'=+2)/(2'=+3)p2/(2fc+3))^ 



(37) 



and also TV(/) = ||^2||i = Op(C„). Now the conditions for Corollary UJ in Appendix |b| reduce to 

^(2fe+2)/(4fc+6)^(2fc+2)/(2fc+3)||g,^ _ ^^||^ ^ 0(^)^ (gg) 

^(2fc+2)/(2fc+3)||g^ - i/2||co ^ as n ^ oo. (39) 



As before, the second condition (39) is implied by Lemma 10 in Appendix O The first condition 
(38) is actually also implied by Lemma |10[ because after applying the assumption (36) on C„, it 
suffices to show that n\\G2 — ^2|loo — 0(1), which is shown in the proof of Lemma 10 Therefore, 
we conclude using Corollary [4] that 



n r- 

\ i=i 



„l/(2fc+3)c2/(2fc+3) 

which gives the rate in the theorem after squaring both sides and dividing by n. 



a 



Finally, we employ the same Cauchy-Schwartz argument ( 34 ) [and the locally adaptive regression 



splines result (37) of Mammen & van de Geer (1997)] to derive a rate for trend filtering. 
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Corollary 2. Under the assumptions of Theorem\A for C„ — 0{n^^^'^''^'^^'^'^') and a tuning param- 
eter value A = Q{n^/^'^^^^'Cn ), the kth order trend filtering estimate /3 m (pi) satisfies 

n 

4=1 

Remark. Although we manage to show that trend filtering achieves the same convergence rate as 
locally adaptive regression splines in the case of underlying functions with growing total variation, 
we require the assumption that C„ grows no faster than 0(n('^+^)/(^'^+^)), which is not required for 
the locally adaptive regression spline result proved in Mammen & van de Geer (1997). But it is 
worth pointing out that for A: = or A: = 1, the restriction C„ = 0(n('^+^^/(^'^"'"^^) for the trend 
filtering convergence result is not needed, because in these cases trend filtering and locally adaptive 
regression splines are exactly the same by Lemma [3] in Section [31 

6 Extensions and discussion 

We have shown that trend filtering, a newly proposed method for nonparametric regression of Kim 
et al. (2009), is both fast and locally adaptive. Two of the major tools for adaptive spline estimation 
are smoothing splines and locally adaptive regression splines; in short, the former estimators are fast 
but not locally adaptive, and the latter are locally adaptive but not fast. Trend filtering lies in a 
comparatively favorable position: its estimates can be computed in 0{n^''^) worst-case complexity 
(at a fixed value of the tuning parameter A, using a primal-dual interior point algorithm), which 
is slower than the 0{n) complexity of smoothing splines, but not by a big margin; its estimates 
also achieve the same convergence rate as locally adaptive regression splines over a broad class of 
underlying functions (which is, in fact, the minimax rate over this class). Trend filtering does have 
its own limitations, namely, it assumes that the input points are evenly spaced. We briefly discuss 
an extension to the unevenly spaced case in Section |6.H but really, this remains a topic for future 
work. 

One way to construct trend filtering estimates, conceptually, is to start with the lasso form for 



locally adaptive regression splines (201, but then replace the matrix G in (22), which is generated 
by the truncated power series, with the matrix H in ( |27[ ), generated by something like their falling 
factorial counterparts. This precisely defines trend filtering, and it has the distinct computational 
advantage that H has a sparse banded inverse (whereas the inverse of G is dense). Moreover, the 
matrix H is close enough to G that trend filtering estimates retain some of the desirable theoretical 
properties of locally adaptive regression splines, i.e., their minimax rate of convergence. While 
this change-of-basis perspective is helpful for the purposes of mathematical analysis, the original 
representation for trend filtering ([2]) is certainly more natural, and also perhaps more useful for 
constructing related estimators whose characteristics go beyond (piecewise) polynomial smoothness 
of a given order. This is discussed in Section [6?2} 



6.1 Unevenly spaced inputs 

Trend filtering implicitly assumes that the inputs xi, . . . x„ are evenly spaced. [The underlying range 
is not really important, and only affects the scaling of the tuning parameter A; for a common spacing 
of d > between inputs, if we wanted to compare the trend filtering problem in ([2]) with, say, the 



locally adaptive regression spline problem in (13) across A values, then we would simply replace the 
factor of n*^ in ^ by l/d^.] How could we extend the trend filtering criterion in (pi) to account for 
arbitrarily spaced inputs? One nice feature of the continuous-time representation of trend filtering 



in ( 29 ) is that it provides a natural answer to this question. 



For arbitrary input points xi < X2 < ■ ■ ■ < Xn, consider defining the basis matrix i? as in (25 1 



(26), and defining the trend filtering estimate by the fitted values Ha of the problem in (23). Aside 
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from its connection to the continuous-time trend filtering problem, this definition is supported by 
the fact that the trend filtering estimates continue to match those from locally adaptive regression 
splines for polynomial orders fc = or 1, as they did in the evenly spaced input case. [This follows 



from the fact that for /c = or 1, the basis functions ft-i, . . . ft.„ defined in (25) match the truncated 



power basis functions gi,. . .gn in (19), with knots as in (12).] Now, to express this definition in a 



more familiar form, the goal is to find a matrix £)(^''=+i) g m(" *^ i)><" so that the estimate 

/3 = argmin hy - ^Wj + 1 • A||i?(-^'=+i)/3||i (40) 

/3eR" 2 fc! 

satisfies (3 — Ha. Note that such a matrix £)(^''=+i) is necessarily defined by the last n — fc — 1 rows 
of H^^ . For fc = 0, it is easy to see that _d(^'1) = D^-^\ the first difference matrix given in (Is]). For 
fc = 1, a short calculation shows that 

7^(-,2)^^(i).diag(i/d2,...lM0.i?(i), 

where di = Xi — Xi^i, i = 2, . . .n are the spacings between inputs [and the leading D^^' above is the 
{n — 2) X {n — 1) version of the first difference matrix]. This parallels the definition of the second 
difference matrix in ([4]), and hence (as our notation would suggest), £)(^'2) can still be thought of as 
a second difference operator, but adjusted to account for the unevenly spaced inputs xi, . . . x„. For 
fc > 2, the calculations become more complicated and it is not clear that a simple recursive formula 
like the one above holds for £)(^''=^+i) (in terms of the inverse spacings and the usual discrete difference 
operators). Empirically, however, the matrix dC^^'^+i) (as given by inverting H and looking at the 
first n — fc — 1 rows) exhibits the appropriate (banded) structure and sign pattern to support the 
interpretation of _d(^''=+i) as an adjusted (fc + l)st order discrete difference operator. 

A detailed study of trend filtering in the case of arbitrary inputs is a direction for future work. 
From a computational perspective, having a closed-form expression for Z)(^''=+i) is of great interest, 



since the generalized lasso form ( 40 1 of trend filtering can be solved much more efficiently than the 



lasso form (23) (assuming that in^'^+i) is banded and H is dense), as discussed in Section 3.5 On 
the theoretical side, the proof given in Section [5] for the minimax convergence rate of trend filtering 
estimates could potentially be extended to the unevenly spaced case. As locally adaptive regression 
splines maintain their minimax rate for arbitrarily spaced inputs, it remains to bound the difference 
between the truncated power and trend filtering basis matrices G and H in this case (as in Lemma 
10 in Appendix [C] for the evenly spaced case). 

6.2 Synthesis versus analysis 

Synthesis and analysis are concepts from signal processing that, roughly speaking, describe the acts 
of building up an estimator by adding together a number of fundamental components, respectively, 
whittling down an estimator by removing certain undesirable components. The same terms are also 
used to convey related concepts in many scientific fields. In this section, we compare synthesis and 
analysis in the context of ii penalized estimation. Suppose that we want to construct an estimator 
of y e M" with some particular set of desired characteristics, and consider the following two general 
problem formulations: 

S^lly-^^ll' + ^ll^lli (41) 

mmhy-/3\\l + X\\Df3h. (42) 



The first form (41 ) is the synthesis approach: here we choose a matrix X G M"^^ whose columns are 



atoms or building blocks for the characteristics that we seek, and in solving the synthesis problem 



(41 ), we are adaptively selecting a number of these atoms to form our estimate of y. Problem (42 1 
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on the other hand, is the analysis approach: instead of enumerating an atom set via X, we choose a 
penalty matrix D G M™^" whose rows represent uncharacteristic behavior. In solving the problem 



(42 1, we are essentially orthogonalizing our estimate with respect to some adaptively chosen rows of 
D, therefore directing it away from uncharacteristic behavior. 

The original representation of trend filtering in (pi) falls into the analysis framework, with D 



£)(A;+i)^ the {k + l)st order discrete difference operator; its basis representation in (23) falls into the 
synthesis framework, with X = H, the falling factorial basis matrix (an unimportant difference is 
that the li penalty only applies to part of the coefficient vector). In the former, we shape the trend 
filtering estimate by penalizing jumps in its (k + l)st discrete derivative across the input points; in 
the latter, we build it from a set of basis functions, each of which is nonsmooth at only one different 



input point. Generally, problems (41) and (|42|) can be equated if D has full row rank (as it does 
with trend filtering), but not if D is row rank deficient [see Tibshirani & Taylor (2011), Elad et al. 
(2007)]. 

Here we argue that it can actually be easier to work from the analysis perspective instead of the 
synthesis perspective for the design of nonparameteric regression estimators. (The reverse can also 
be true in other situations, though that is not our focus.) For example, suppose that we wanted to 
construct an estimator that displays piecewise polynomial smoothness across the input points, but 
additionally, is identically zero over some appropriately chosen subintervals in its domain. It helps 
to see an example: see the left panel in Figure |9] Working from the analysis point of view, such an 
estimate is easily achieved by adding a pure ii penalty to the usual trend filtering criterion, as in 

/3 = argmin ^||y - (3\\l + X,\\D^''+'^ I3\\, + A2||/3||i. (43) 

/3GR" ^ 



We call (43) the sparse trend filtering estimate. This could be of interest if, e.g., y G E" is taken to 
be the pairwise differences between two sequences of observations, e.g., between two response curves 
over time; in this case, the zeros of /3 indicate regions in time over which the two responses are 
deemed to be more or less the same. It is important to note that an estimate with these properties 
seems difficult to construct from the synthesis perspective — it is unclear what basis elements, when 



added together, would generically yield an estimate like that in ( 43 ) . 

As another example, suppose that we had prior belief that the observations y G M" were drawn 
from an underlying function that possesses different orders of piecewise polynomial smoothness, ki, 
k2 , at different parts of its domain. We could then solve the mixed trend filtering problem, 

p = argmin hy - ^g + Aip('=^+i)/3|ji + X2\\D^'^+''> ^h- (44) 

The right panel of Figure |9] shows an example, with an underlying function that is mixed piecewise 
quadratic and piecewise constant. Again it seems much more difficult to construct an estimate like 



(44), i.e., one that can fiexibly adapt to the appropriate order of smoothness at different parts of its 
domain, using the synthesis framework. Further study of the synthesis versus analysis perspectives 
for estimator construction will be pursued in future work. 
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Sparse trend filtering 








Mixed trend filtering 
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Figure 9: Left panel: a small example of sparse quadratic trend filtering (k = 2). The estimate f3 in 
(431 is identically zero for inputs approximately between and 0.25, and 0.6 and 0.75. Right panel: 



an example of constant/ quadratic mixed trend filtering (ki = and fc2 = 2j. The estimate defined 
in ( 44 ) is first piecewise quadratic over the first half of its domain, but then is flat in two stretches 



over the second half. 



A Lasso and continuous- time representations 

A.l Proof of Lemma [2] 

Consider making the variable transformation a — n^ /k\ ■ Df3 in (pi), with D £ M"^" defined as 



D 



(0) 



D 



i:)(fe+i) 



and Dl € M^^" denoting the first row of the ith discrete difference matrix I?^*', for i = 0, . . . fc. We 



first show tliat D 



A'T where M = Af (°) 



Af^*^) and 



m(*) 



/, 







-'^(n— i)x(n— i) 



for i — Q, . . .k. 



Here /^xi is the ixi identity matrix, and L(ji-i)yc{n-i) is the {n — i) x [n — i) lower triangular matrix 
of Is. In particular, we prove that M~^ = D by induction on k. When fc = 0, that 
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can be seen directly by inspection. Assume now that the statement holds for fc — 1. Then 



I 
L-i 



D 



(0) 



(fc-i) 






(0) 



D'l 






where we have abbreviated / — h.xk and L = L 
inductive hypothesis. Moreover, 



L-^D^k) 



kxk 

1 
-1 







1 

-1 1 







^(ti— fc)x(n— fe) 

■ 




-1 1 



, and in the second equality we used the 



D(k) 



D 

jj(kA 



(k) 



completing the inductive proof. Therefore, substituting a — ■n!' /k\ ■ Dj3 in ^ yields the problem to 
the problem 

. 1 



a = argmm 



y rMa 



j=k+2 



It is now straightforward to check that the last n — fc — 1 columns of {k\/n^)M are the same as those 



of H , as defined in (24 1 in the lemma. Further, the first fc + 1 columns of {k\/n^)M have the same 



linear span as the first fc + 1 columns of H , which is sufficient because the ^i penalty above [and in 



(23)] only applies to the last n — k — 1 coefficients. 



n 



A. 2 Proof of Lemma |4] 



Define s. 



(0) 



1 for all i, and 



(fc) 






,(fe-i) 



for fc = 1,2,3, . 



Jk) 



i.e., si is the fcth order cumulative sum of (1, 1, ... 1) e R\ with lag 1. We will prove that 



{x- k){x- fc + 1) 



(x-1) 



fc! 



s^^^ for aU X = 1, 2, 3, . . . and fc = 1, 2, 3, 



(45) 



by induction on fc. Note that this would be sufficient to prove the result in the lemma, as it would 



show that the bottom right (n — fc — 1) x (n— fc — 1) sub-block oi H in (24), which can be expressed 
as 



fc! 



Sk) 



-'fc+i 

*fc+2 *fc+l 







Sk) 



,(fc) 



'n-1 ■'n-2 






'fc+1 
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is equal to that in (27 1. Here is now give the inductive argument for (45 1. The base case, for fc = 1: 



,(1) 



clearly x — 1 = Sx for all x. Assume that the inductive hypothesis holds for k. Then for any x, 



=(fc+i) 



E 



(i - k){i - k + 1) • ... • (i- 1) 



z=l 
x—1 i—1 



fc! 



^^^ U - k + l){j - k + 2) ■ . . . ■ jj - 1) 

^^ (k-iy. 

i=l j=l ^ ' 

by the inductive hypothesis. Switching the order of the summations, 

x-2 x-\ 

U-fe + iju 

(fc-1) 

(j-fc + l)(.?-fc + 2)-...-(j-l) 



(fc+i) ^ V V 0'-fc + i)0'-fc + 2)-----(j-i) 



x-2 

E 

""'(j-fc + l)(j-fc + 2).....(j-l) 



(fc-1) 



E 



(fc-1)! 
Grouping terms and again applying the inductive hypothesis. 



(a; — fc — 1 — J + fc) . 



(fc+i)^ (a:-fc-l)(a;-fc-2)-....(a;-2) 

fc! ■ *^' 



a; — fc — 1) — Sx^x ■ fc. 



Noting that s 



(fe+i) _ ^(fc+i) 



(x — fc — 1) • . . . • (a; — 2)/fc!, and rearranging terms finally gives 
(fc+i) _{x-k- 2)(x - fc - 1) • .... (a; - 2) 



^-^ (fc + 1)! 

Since x was arbitrary, this completes the inductive step, and hence the proof. 



a 



B Bounding the difference in lasso fitted values 

B.l Lasso problems in standard form 

Consider two lasso problems sharing the same outcome vector y e M", 

mm i||y-Za||2 + A'||a||i, 



(46) 

(47) 



where X, Z e M"^^, and A, A' > 0. One might ask: if A, A' are chosen appropriately, can we bound 



the difference in the fitted values XQ and Za. of ( 46 1 and ( 47 ) , respectively, in terms of the difference 



between X and Zl This question can be answered by first deriving a "basic inequality" , much like 
the one used for bounding lasso prediction error. 



Lemma 6. For fixed A, A', solutions 9 and a of lasso problems (46) and (471 satisfy 



-\\X6-Za\\l < {y~Xe,Za) + iX' -X)\\9\\i - A'||d||i + i?. 



(48) 



where R = \\\{X - Z)e\\l + {y - X9, {X - Z)9) . 
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Proof. Note that by optimality, 

l\\y-Za\\l + X'\\a\h<l\\y-Z9\\l + y\\e\\, 

= l\\y-xe\\l + \'\\e\\^ + R, 

where R — ||ly — ^^111 ~ |lly ~ ^^lli- We can rearrange the above inequahty as 

^\\Za\\l - ^\\Xe\\l < {y,Za - X6) + \'\\6\\, - \'\\a\\, + R. 

Writing y = XO + (y — X0) within the inner product on the right-hand side above, and bringing the 
term involving X9 over to the left-hand side, we have 

hxe- Za\\l < {y - xe, Za - xe) + x'\\e\\i - X\\a\\i + R, 

^(y- Xe, Za) + (A' - A)||^||i - A'||a||i + i?, 

where in the last line we used the fact that (y — X6,Xd) = A||6'||i, from the KKT conditions for 
problem ( 46 ) . Lastly, we rewrite 



R = l\\ze\\l-^\\x9\\l + {y,x§-ze) 



1, 



xe - ze\\l + {y- xe, xe - ze), 



which completes the proof. D 

In order to bound \\X6 — Za\\2, the goal now is to determine conditions under which the right- 



hand side in (481 is small. Note that both terms in R involves the difference X — Z, which will have 



small entries if X and Z are close. The second term in (48 1 can be controlled by taking A' and A to 



be close. As for the first term in (48 1, we can rewrite 



{y - xe, Za) ^ {y- Xe, [Z - X)a) + {X'^ {y - Xe),a); 
above, the first term again involves the difference X — Z, and the second term can be balanced by 



the term — A||q;||i appearing in (48) if A and A' are chosen carefully. These ideas are all made precise 
in the next lemma. 



Lemma 7. Consider a sequence of lasso problems ( 46 1 , ( 47 ) (all quantities p,y,X, Z,\,\' considered 



as functions of n), such that A' = (1 + e)A for some fixed e > 0. Assume that 



^\\x-z\\^\\eh = o[d\\\eh 



^p\\x - z\u\y - xeh ., ^ ,... 

and ; > U as n —>■ oo. (49) 



A 



Then any solutions e,a of (461, (47) satisfy 



\\xe-Za\\2 = o{\J\\\e\\i). (50) 

Proof. As suggested in the discussion before the lemma, we rewrite the term (y — XO, Za) in the 



right-hand side of ( 48 ) as 



{y - xe, Za) = {y- Xe, [Z - X)a) + {X'^{y - Xff), a) 
<||y-X^||2||(X-Z)a||2 + A||a||i 



< y/j>\\y - xehwx 



|d||i + A||a||i, 
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where in the second hne we used Holder's inequahty and the fact that WX"^ [y — X6)\\oo < A from the 



KKT conditions for (46), and in the third Hne we used the bound ||^a;||2 "£ \/p\\M\oo\\x\\i for a matrix 
A G ]R"^P, where ||^||oo denotes the maximum element of A in absolute value. The assumption that 
y^||X — Z||oo||y — X9\\2/\ -^ now implies that 

{y- Xe,Za) < eA||a||i +A||d||i = (1 + e)A||a||i. 



Plugging this into the right-hand side of (48), and using A' = (1 + e)A, we see that 

1, 



1X61 



Finally, 



^<2(VpII^ 



Za\\l < eX\\0\ 



Vp\\x 



R. 



\v-xe\\2\\e\\ 



and using both conditions in (49), we have R = O(A||0||i), completing the proof. 



n 



Remark. Had we instead chosen A' = A, which may seem like more of a natural choice for pairing 



the two problems ( 46 1 , ( 47 ) , the same arguments would have yielded the final bound 

\\Xe - Z&h = 0(^Amax{||^||i,||a||i}' 
For some purposes, this may be just fine. However, the envisioned main use case of this lemma is 



one in which some (desirable) theoretical properties are known for the lasso problem ( 46 1 with a 
particular predictor matrix X, and analogous results for a lasso problem (47) with similar predictor 



matrix Z are sought (e.g., this is the usage for locally adaptive regression splines and trend filtering); 
in such a case, a bound of the form (50) is preferred as it does not depend at all on the output of 



problem (47) 



The second condition in (49 1 involves the quantity y — XO, and so may appear more complicated 



than necessary. Indeed, under weak assumptions on y and X6, this condition can be simplified. 



Corollary 3. Consider again a sequence of lasso problems (46), (47) such that A' = (1 + e)A for 
some fixed e > 0. Assume that the outcome vector y is drawn from the regression model 

where ei, . . . e„ are i.i.d. with E[ef] < oo, and assume that ||/i — X0||2 ~ Op(\/n). Further assume 

y^\\X - Z\\oo\\0\\i ^ Or(Jx\\§\\i) and y^||X - Z||oo/A ^ asn^oo. 



■(\/^ 



Then any solutions 6, a of (46), (47) satisfy 

\\Xe-Za\\2^0i 
Proof Note that 

Both terms on the right-hand side above are Op(-y/n); for the second term, this is true by assumption, 
and for the first term, wc can use the fact that ei, . . . e„ are i.i.d. with finite fourth moment to argue 



En 
»=1 



E\e'] > 1 ] < ^^ 



0, 



where we used Chebyshev's inequality. Therefore ||y — X9\\2 — Op{^/rl), and to show that ^/p\\X — 
Z\\oo\\y - X§\\2/X -^ in probability, it suffices to show ^/np\\X - Z\\oo/X -^ 0. D 
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Remark. The fourth moment condition on the errors, E[e^] < oo, is not a strong one. E.g., any 



sub-Gaussian distribution [which was our distributional assumption in (311 for the theoretical work 
in Section p] has finite moments of all orders. Moreover, the assumption ||/i — X0||2 = Op{^/n) is 
also quite weak; we only maintain that the average prediction error j|/i — X9\\2/^/n is bounded in 
probability, and not even that it converges to zero. 



B.2 Lasso problems in nonstandard form 

Now consider two lasso problems 



mm-||y-X0||2 + A||02||i, 

^l^„ olly-^"ll2 + ^'ll"2||i, 



(51) 
(52) 



where the coefficient vectors decompose a.s 6 = {61,92), a ^ {ai,a2) G MP'^^^^^, and we partition the 
columns of the predictor matrices accordingly. 



x = [Xi X2], z=[Zi Z2]e 



Bnx(pi+p2) 



In words, the ii penalties in (51 1, (52) only apply to part of the coefficient vectors. We have the 



same goal of the last section: to bound \\X9 — Za\\2 in terms of the differences between X and Z. 
Simply transforming (51 1, ( [52| ) into standard form lasso problems (by partially solving for 9i, ai) 
will not generically provide a tight bound, because the predictor matrices of the resulting lasso 
problems have restricted imagesFI Therefore, we must rederive the analogs of Lemmas rol and [TJ and 
Corollary |3] For the upcoming bounds in Lemma |9] and Corollary [4J it is critical that Xi = Z\, 
i.e., the predictor variables corresponding to unpenalized coefficients in (|5ip, (52) are identical. We 



state the results below, but omit the proofs because they follow essentially the same arguments as 
those in the last section. 



Lemma 8. For fixed A, A', solutions 9 and a of lasso problems (51) and (52) satisfy 
hx6 - Za\\l <{y- X9,Za) + (A' - A)||^2||i - A'||a2||i + R, 



where R ^ ^\\{X - Z)9\\^ + {y - X9, {X - Z)9) . 



Lemma 9. Consider a sequence of lasso problems (51 ), ([52]) (where all quantities pi,p2,y, X, Z, A, A' 
are considered as functions of n), such that Xi = Zi and A' = (l-|-e)A for some fixed e > 0. Assume 
that 



'P2\\X2 - Z2\ 



O \ \ 



and 



f)2\\X2-Z2\\oo\\y-X9\[ 



Then any solutions 6, a of (46), (47) satisfy 



\X9 - Za\U = 0[\ \ 



A 



as n -^ 00. 



''In more detail, solving for X\di = Px^ {y - X282) and ZiSi = Pz^ {y - .^202) [where P4 = A{A^ A)+ A^ denotes 
the projection matrix onto col(A), the column space of a matrix A\, yields "new" standard form lasso problems with 
predictor matrices (/ — Pxi)^2 and (/ — Pzi)Z2- This is problematic because we require a bound in X2 and Z2. 
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Corollary 4. Consider again a sequence of lasso problems (51 1, (52 I with Xi = Zi and X' = (l + e)A 



for some fixed e > 0. Assume that the outcome vector y is drawn from the model 

where ei, . . . e„ are i.i.d. with E[ef] < oo, and assume that \\fi — X6\\2 ~ Op{-^n). Further assume 



'p^||^2-^2||oo||e2||i=Op V^ll^2||i and ^j^\\X2-Z2\\oo/X^0 asn^oo. (53) 



Then any solutions 6, a of (511, (52) satisfy 



\\X9 - Za\\2 = Op{^^ XlMi 

C Convergence of trend filtering and locally adaptive regres- 
sion spline basis matrices 

Lemma 10. For any integer k > 0, consider the matrices G2, H2 € l]j"x("-'=~i)^ f/jg last n — k — 1 



columns of the trend filtering and locally adaptive regression spline basis matrices in (22), (27). With 
evenly spaced inputs on [0, 1], {xi, X2, ■ ■ ■ Xn} — {^/n, 2/n, . . . 1}, we have 



i'\\G2-H2\\^,^0 



as n ^ 00, 



(54) 



for any q < I. 



Proof. For fc = 0, 1 the result is vacuous, as G2 — H2 according to Lemma [3] in Section [3] Hence we 
assume fc > 2, and without a loss of generality, we assume that k is even, since the case for k odd 
follows from essentially the same arguments. By Lemma [Til for large enough n. 



\G2-H2\\^^\[\[{n-l-i)- (n-l 



\i=l 



fc + 2 



and therefore 



IG2 — i?2|loo — 



n-l 



fc+2 ^ 
2 ) 



nti("-i 



in- 1- 



fe+2\* 



-1 



K 

It is clear that a„ — > 1 as n — >■ cx). Hence it remains to show that &„ — > as n — >■ cx). 
To this end, consider the term 



\{{n^l-i) 

i=l 



{n~k-iy: 



We use Stirling's approximation to both the numerator and denominator, writing 

(n-l)! _ (n-l)!/((n-l)«-V2e-»+iy2^) (n-l)"-i/2 



(n-fc-1)! (n-A:-l)! /((n-fc-l)"-fc-i/2e-»+fe+iy2^) (n - fc - 1)" ^ 1/2 
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Therefore 



1 — 71 I 1 /o 



("-1-^) {{n-k-l)/{n-l-^)) 



0+ 


n 


-fc-1 J 


-i/- 


g-(fe-2)/2 


(' 


- 


(fc+2)/2\"" 


-1/2 


e(fc+2)/2 



dn 

Note that c„ -> 1 by Stirhng's formula, and (i„ — > 1 by the well-known limit for e^ 



lim 

i— >-C30 



(1 + 9*. (55) 



The question is of course how fast these two sequences c„, d„ converge; if the remainder c„(i„ — 1 is 
o{n^'^), then 6„ -> and indeed n''||G2 - fl"2||oo —> 0. 

First we address c„. It is known that Stirling's approximation satisfies [e.g., see Nemes (2010)] 

n\ ^ , 1 1 

e^", where — — — - < 7„ < 



„n+l/2g-ny27r 1271+1"' 12n 

Hence 

c„ = exp(7„_i - 7„_fc-i) < exp ( -— — 

\12(n — 1) 



Next we address d„. Lemma 12 derives the following error bound for the exponential limit in (55): 

/ X \^ r — X 

1 + - e"^ = e*"'", where < 4 n < 0, 

\ nJ n -\- X ' 

for sufficiently large n. Therefore 

/-^ ^ (fc-2)/2 \ 1/2 
d„ = exp ((5(fe_2)/2,n-fc-l - '5-(fc+2)/2,n-l) ' ( ^ (l+2)/2 j 

= exp ((5(fe_2)/2,n-fc-l - '5-(fc+2)/2.n-l) ' ( _ ^ _ -^ j 

, (A: + 2)2 \ / n-1 ^ ^^^ 
< exp ' ' ' 



^4n-2(fc + 4)y V"'-^-l, 

We can simplify, for large enough n, 

1 {k + 2f ^ (fc + 2)2 

12(n-l) 4n-2(A; + 4) - n 

and putting this all together, we have 



5„ = n'^ic^dr. - 1) < n' fexp (^^^) • ( ^"^^ J '^' - l] • 

The bound on the right-hand side above converges to zero after an application of I'Hopital's rule. 
(We note that here it is critical to have g < 1; if g = 1 then the right-hand side above converges to 
a positive constant, and for g > 1 it diverges.) As fe„ > (recall that fe„ = rt'||G2 — H2\\oo/o-n and 
a„ > 0), this completes the proof. D 
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Lemma 11. Let k >2. If k is even, then 



fe / 1 . n\ k 



\\G,-H,\U^^^(f[in-l-^)-|^n-l-'^ 

for sufficiently large n; if k is odd, then 

l|G2-i?2|U = ^(^(n-l-^)-(n-l-^y^, 

for sufficiently large n. 

Proof. We prove the result for k even; the proof for k odd is similar. Consider first the sequence 

' ( (fc + 2) V 

i=l ^ ^ 

Note that 

fk{k + 2) k{k + l)\ fe_i ^, fc_2N 
an = y 2 ~ 2 j • " + ^(^ )• 

Because the coefficient of the leading term is positive, we know that a„ — > oo as n — >■ oo; further, for 
large enough n, this convergence is monotone (since a„ is polynomial in n). Now recall that G2,H2 



are the last n — k — 1 columns oi G,H, in (22 1, (27). Hence 



1 



a i<j + {k + 2)/2 



tti-j a i > j + k, 



{H2 - G2)^J = :^- {-{i-i-{k + 2)/2Y if i + {k + 2)/2 < z < j + fc 



given by taking j + fc + 1 in place of j in (22 1, (27). For i — j < k, the term (i — j — (fc + 2)/2) is 



bounded by k . And since a„ f oo, as argued above, we conclude that \\G2 — H2\\oo = an-i/n for 
sufficiently large n. D 

Lemma 12. For any x G M, and /or sufficiently large t > 0, 

Proof Define /(i) = (1 + x/t)*. Consider 

log/(t)-t-(log(f + a:)-logi), 

which is well-defined as long as t > —x. Because log is a concave function, its tangent line is a global 
overestimate of the function, that is, 

log(i + x) <logt+ - ■ X, 

which means that t ■ (log(t + x) ~ logi) < x, i.e., f{t) < e^ . Hence f{t)e^^ — e''^* where 5x,t < 0. 
The lower bound on 6x.t follows similarly. Again by concavity, 

logi<log(t + a;) + ^— -(-a;), 
t + x 

so log(i-|-a;) — logi > x/{t+x), and f{t) > exp{tx/{t + x)). Therefore f{t)e~^ > exp{tx/{t+x) — x) = 
exp{-x'^/{t + x)). a 
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